I/O Throttling and Coordination for MapReduce
نویسندگان
چکیده
As a leading framework for data intensive computing, MapReduce has gained enormous popularity in large-scale data analysis. With the increasing adoption of multi/many core platform, more and more MapReduce tasks are now running on the same node and sharing the same storage resources. The concurrency of tasks raises the issue of I/O stream congestion. We have observed significant throughput drops and task delays caused by I/O stream congestion in the MapReduce framework. In this paper, we propose two techniques to address the I/O stream congestion in MapReduce tasks. First, I/O stream throttling is presented to limit the number of concurrent I/O streams, and avoid throughput drops. Furthermore, to alleviate the I/O contention among multiple MapReduce jobs, I/O coordination orders the I/O streams in accordance to job priority. By exclusively granting I/O resources to streams with higher priorities, the coordination effectively shortens the average job completion time. Experimental results from Hadoop confirm that the proposed techniques improve the average job completion time by up to 33.74%. In addition, the proposed techniques greatly accelerate the execution of high priority jobs; thereby, showing it is capable of fostering QoS in the MapReduce framework. KeywordsI/O stream; MapReduce; I/O scheduling; throttling; coordination
منابع مشابه
ThemisMR: An I/O-Efficient MapReduce
“Big Data” computing increasingly utilizes the MapReduce programming model for scalable processing of large data collections. Many MapReduce jobs are I/O-bound, and so minimizing the number of I/O operations is critical to improving their performance. In this work, we present ThemisMR, a MapReduce implementation that reads and writes data records to disk exactly twice, which is the minimum amou...
متن کاملThrottling I/O Streams to Accelerate File-IO Performance
To increase the scale and performance of scientific applications, scientists commonly distribute computation over multiple processors. Often without realizing it, file I/O is parallelized with the computation. An implication of this I/O parallelization is that multiple compute tasks are likely to concurrently access the I/O nodes of an HPC system. When a large number of I/O streams concurrently...
متن کاملToward Scheduling I/O Request of Mapreduce Tasks Based on Markov Model
In Cloud storage of multiple CPU cores, many Mapreduce applications may run in parallel on each compute node and collocate with local Disks storage. These Disks storage are shared by multiple applications that use full CPU power of the node. Each application tends to issue contiguous I/O requests in parallel to the same Disk; however if large number of Mapreduce tasks enters the I/O phase at th...
متن کاملThermal Attacks on Storage Systems
Disk drives are a performance bottleneck for data-intensive applications. Drive manufacturers have continued to increase the rotational speeds to meet performance requirements, but the faster drives consume more power and run hotter. Future drives will soon be operating at temperatures that threaten drive reliability. One strategy that has been proposed for increasing drive performance without ...
متن کاملThe Efficiency of MapReduce in Parallel External Memory
Since its introduction in 2004, the MapReduce framework has become one of the standard approaches in massive distributed and parallel computation. In contrast to its intensive use in practise, theoretical footing is still limited and only little work has been done yet to put MapReduce on a par with the major computational models. Following pioneer work that relates the MapReduce framework with ...
متن کامل